The previous class session introduced decision trees as a fundamental algorithm used in machine learning. We also explored how spatially lagged attributes can be used as features in a predictive model as a means of accounting for or representing spillover effects or contagion processes. Recall that in a regression context, the presence of spatial autocorrelation in our data is problematic and something to be removed or at the very least, the basis for treating any inferences with caution. On the other hand 👈 spatial autocorrelation in a machine learning context is viewed as just another source of patterns in the data that we can harness to bolster our predictive accuracy 😃

This Assignment continues our foray into geospatial machine learning. The assigned readings for this week are available here and may be useful as you make your way through this activity:

Setup and Data Wrangling


First we load the necessary R packages and as usual, the necessary data have been placed in the data folder for you 🆒 As you read this week (and last week), the general aim of the two Steif chapters is to use observed home price data from Boston to train, apply, and evaluate a predictive model of home prices. However, we will pivot and follow a similar path using local Charlottesville data. We pull down the neighborhood boundaries layer from the Charlottesville open data portal, then apply a map projection to convert the data to a projected coordinate system. This Assignment also integrates data on reported crimes as a factor in predicting residential housing prices. The primary Charlottesville dataset is ready to go, but you should know that it does not contain all of the same variables as the Boston dataset from the readings. Instead, we have the following:

Eventually, we will add the following features as part of the Assignment:

The reported crimes data were taken from here and the remaining variables were pieced together by the instructor from Real Estate (Residential Details), Parcel Details 2, and Real Estate (Sales). Let’s see what we can do with these data!


# Get neighborhood polygons from the open data portal in GeoJSON format, 
# then apply a map projection...
nhoods_utm_sf <- 
  st_read("https://gisweb.charlottesville.org/cvgisweb/rest/services/OpenData_1/MapServer/97/query?outFields=*&where=1%3D1&f=geojson") %>%
  st_transform("EPSG:32617")
## Reading layer `OGRGeoJSON' from data source 
##   `https://gisweb.charlottesville.org/cvgisweb/rest/services/OpenData_1/MapServer/97/query?outFields=*&where=1%3D1&f=geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 68 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -78.52375 ymin: 38.00959 xmax: -78.44651 ymax: 38.07049
## Geodetic CRS:  WGS 84
cvilleHomes_utm <- st_read("./data/cvilleHomes_utm.geojson")
## Reading layer `cvilleHomes_utm' from data source 
##   `C:\Users\bw6xs\Documents\PLAN 6122\2023\Assignments\Lab Assignment 7\Lab Assignment 7\data\cvilleHomes_utm.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 4023 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 717392.7 ymin: 4209876 xmax: 723287.8 ymax: 4216385
## Projected CRS: WGS 84 / UTM zone 17N
# Collapse the many categories on the number of floors attribute, as we 
# did last week so there is a more manageable number of categories...
cvilleHomes_utm_sf <- 
  cvilleHomes_utm  %>%
  mutate(NmbrOfS.collapse = case_when(
                  NmbrOfS >= 0 & NmbrOfS < 3  ~ "Up to 2 Floors",
                  NmbrOfS >= 3 & NmbrOfS < 4  ~ "3 Floors",
                  NmbrOfS > 4                    ~ "4+ Floors"),
         NmbrOfS.cat = as_factor(NmbrOfS.collapse)) %>%
  filter(SalAmnt > 0.0, 
         SqrFtFL > 0.0) %>% 
  mutate_at(c("Bedroms", "FllBthr", "HlfBthr", "Fireplc"), as.numeric) %>%
  mutate_at(c("Heating", "BsmntTy", "ExtrnlW"), as_factor)


# Read in the reported crimes information... 

cvilleCrimes <- st_read("./data/cvilleCrimes.shp")
## Reading layer `cvilleCrimes' from data source 
##   `C:\Users\bw6xs\Documents\PLAN 6122\2023\Assignments\Lab Assignment 7\Lab Assignment 7\data\cvilleCrimes.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 4188 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -78.53321 ymin: 38.00525 xmax: -78.4303 ymax: 38.07422
## Geodetic CRS:  WGS 84
cvilleCrimes_utm <- st_transform(cvilleCrimes,  crs = "EPSG:32617")


# Extract only reports of aggravated assault and get rid of observations
# with missing values, then apply consistent map projection...
cvilleAssaults_utm_sf <-
  cvilleCrimes_utm %>%
    filter(Offense == "Assault Aggravated") %>%
    drop_na() %>%
    distinct()

# Note that the LENGTHUNIT for our sf objects is meters...
st_crs(cvilleHomes_utm_sf)
## Coordinate Reference System:
##   User input: WGS 84 / UTM zone 17N 
##   wkt:
## PROJCRS["WGS 84 / UTM zone 17N",
##     BASEGEOGCRS["WGS 84",
##         DATUM["World Geodetic System 1984",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4326]],
##     CONVERSION["UTM zone 17N",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-81,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["Between 84°W and 78°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Ecuador - north of equator. Canada - Nunavut; Ontario; Quebec. Cayman Islands. Colombia. Costa Rica. Cuba. Jamaica. Nicaragua. Panama. United States (USA)."],
##         BBOX[0,-84,84,-78]],
##     ID["EPSG",32617]]
# Create a quarter mile (402.336 meters) buffer around each housing unit
cvilleHomes_utm_quarter_mile_sf <- st_buffer(cvilleHomes_utm_sf, 402.336) 

# Now create a new attribute in the sf object with the number of reported
# aggravated assaults within that buffer...
cvilleHomes_utm_sf <- cvilleHomes_utm_sf %>%
  mutate(NumAggAssaults = lengths(st_intersects(cvilleHomes_utm_quarter_mile_sf, cvilleAssaults_utm_sf)))


# Visualize the location of housing units sold, reported aggravated assaults, 
# and the distance buffer we are using (i.e., in red)...
ggplot() + 
    geom_sf(data = nhoods_utm_sf, aes(fill = "white"), color = "grey25") + 
     geom_sf(data = cvilleAssaults_utm_sf, aes(col = "blue"), pch = 20, size = 2) + 
     geom_sf(data = cvilleHomes_utm_quarter_mile_sf[c(89, 302), ], color = "red", fill = NA) +
     geom_sf(data = st_centroid(cvilleHomes_utm_sf), fill = NA, aes(color = "grey75"), pch = 20, size = 1.2) + 
  scale_fill_manual(name = "Legend", values = c("white", "pink"), labels = c("Neighborhoods", "Example Buffer")) +
  scale_color_manual(name = "", values = c("blue", "grey75"), labels = c("Assaults", "Homes Sold")) +
  theme_void() +   
  theme(text = element_text(size = 20))
## Warning in st_centroid.sf(cvilleHomes_utm_sf): st_centroid assumes attributes
## are constant over geometries of x


Note that in the map above, the red circles are examples of the one-quarter mile buffer used to count the number of reported aggravated assaults near each home in our dataset. In some cases the count is zero and in others we can see that there are multiple blue points.


Leveraging Spatial Autocorrelation for Prediction


In the code chunk below, we use a new R package called sfdep to generate a spatial weights matrix, to create a spatially lagged home price feature, and calculate spatial statistics such as the Global Moran’s I. A spatially lagged variable or feature is simply the weighted average of the sales prices of all homes that are considered “neighbors” of this particular home and this approach is used frequently in spatial econometrics. If the spatial lag of sales prices is an important feature, that would suggest lots of clustering (e.g., Euclidean zoning, segregation based on race/ethnicity/income) in the distribution of residential real estate prices in our study area. This page provides a little more detail on spatial neighbors in R and even has an interactive app where you can explore how changing the way “neighboring” relationships are defined might affect the inferences we take away. If you need to use inverse distance weights, this page demonstrates how to generate them.


# Create a neighbor list for homes sold based on its 5 nearest neighbors....
cvilleHomes_utm_centroids_sf <- st_centroid(cvilleHomes_utm_sf)
## Warning in st_centroid.sf(cvilleHomes_utm_sf): st_centroid assumes attributes
## are constant over geometries of x
cvilleHomes_knn <- sfdep::st_knn(cvilleHomes_utm_centroids_sf, k = 5)

# Create lagged version of sales price for each home sold that is the 
# weighted average of the sales prices of its 5 nearest neighbors...
cvilleHomes_utm_sf <- cvilleHomes_utm_sf %>%
  mutate(lagSalAmnt = sfdep::st_lag(x = .$SalAmnt, nb = cvilleHomes_knn,
                                wt = st_weights(cvilleHomes_knn, allow_zero = TRUE), allow_zero = TRUE))


# Plot sales price against lagged sales price, like the Steif chapter...
ggplot() + 
  geom_point(data = cvilleHomes_utm_sf, mapping = aes(x = lagSalAmnt, y = SalAmnt), fill = "orange", color = "white", 
             alpha = 0.25, pch = 21, size = 3, stroke = 1.2) +
      geom_smooth(data = cvilleHomes_utm_sf, method = "lm", mapping = aes(x = lagSalAmnt, y = SalAmnt), se = FALSE, 
                  color = "green4", alpha = 0.4) +
      labs(title = "Correlation between sale price \n and the spatial lag of sale price", 
           y = "Sale Amount ($)", x = "Mean of 5 Nearest Home Sale Prices ($)") + 
      scale_y_continuous(labels=scales::dollar_format()) +
      scale_x_continuous(labels=scales::dollar_format()) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5), 
             text = element_text(size = 20))
## `geom_smooth()` using formula = 'y ~ x'

# Say it another way....
print(str_c("Lagged sales price explains ", 
            round(((summary(
                lm(SalAmnt ~ lagSalAmnt, data = cvilleHomes_utm_sf))$adj.r.squared) * 100), 2),
            " percent of the variation in observed home sales prices.")
            )
## [1] "Lagged sales price explains 84.78 percent of the variation in observed home sales prices."
# Test for presence of spatial autocorrelation in home sales prices...
cvilleHomes_moran <- sfdep::global_moran_test(x = cvilleHomes_utm_sf$SalAmnt,  nb = cvilleHomes_knn,
                                wt = st_weights(cvilleHomes_knn, allow_zero = TRUE))

cvilleHomes_moran
## 
##  Moran I test under randomisation
## 
## data:  x  
## weights: listw    
## 
## Moran I statistic standard deviate = 71.208, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.8565050992     -0.0004458315      0.0001448273


In the above code chunk, the scatterplot shows that as sales price increases, so does the price of nearby houses—and vice versa. Creating a spatially lagged feature is one approach for leveraging spatial autocorrelation in a predictive model like this, but we could also use neighborhood boundaries as a proxy for similar spillover effects 😨

That chunk also performs a Global Moran’s I test for spatial autocorrelation using the sfdep::global_moran_test function. The significant p-value indicates that the result is statistically significant, while the 0.85 value of the I statistic itself suggests the presence of strong, positive spatial autocorrelation (i.e., homes with high sales prices are likely to be located near other homes with high sales prices).


# Link neighborhood attributes to home sales dataset...
cvilleHomes_utm_sf <- st_join(cvilleHomes_utm_sf, nhoods_utm_sf, join = st_within, left = TRUE)
dim(cvilleHomes_utm_sf)
## [1] 2244   26
# Now filter out unwanted obserations and make the neighborhood
# attribute a factor rather than a character type...
cvilleHomes_utm_sf <- cvilleHomes_utm_sf %>%
  filter(SalAmnt > 0.0, 
         SqrFtFL > 0.0, 
         is.na(NeighHood) == FALSE) %>%
   mutate_at("NeighHood", as_factor)

dim(cvilleHomes_utm_sf)
## [1] 2040   26


Begin With A Single Decision Tree Model


The code chunk below partitions (divides) the dataset into two pieces—a training dataset with approximately 75 percent of the observations and a test dataset with the remaining approximately 25 percent of the observations. It also stratifies on neighborhood name. We use the caret::trainControl function to set up the standard k-folds cross-validation routine with k = 10, but it could just as easily be 5 folds as illustrated in the image below.



As described in class, k-folds cross-validation can help us avoid overfitting and assist with tuning hyperparameters for a given machine learning algorithm. Cross-validation involves randomly dividing the data into k “slices” or folds of roughly the same size. The first fold is treated as a validation set and we train the model on the remainder of the data (i.e., the other 9 slices if k = 10). Next, we calculate the Mean Squared Error or another performance metric when the trained model is applied to the one fold that has been set aside. We repeat this process k times with a different fold or “slice” acting as the validation set. This technique provides us with k estimates of the test error rate as opposed to the single test error rate that we would otherwise have from the training set.


As mentioned in class, the caret package currently supports over 230 algorithms and by setting the method argument of the caret::train function to “rpart”, we are fitting a simple decision tree model. Because the outcome we are predicting is continuous rather than categorical, it is a regression tree rather than a classification tree.


# This allows us to reproduce the results in the future...
set.seed(42)  

#################################################################################
# This function is part of the caret package and the y argument outlines
# how to stratify the data. From the documentation "the sample is split 
# into groups sections based on percentiles and sampling is done within 
# these subgroups" and the paste function simply concatenates several features. 
#
# This allows us to stratify on multiple features rather than just the outcome
# we are trying to predict (i.e., the training and test sets are more comparable)
#################################################################################
inTrain <- createDataPartition(
              y = cvilleHomes_utm_sf$NeighHood, 
              p = 0.75, list = FALSE)

cville_training <- cvilleHomes_utm_sf[inTrain, ] 
cville_testing <- cvilleHomes_utm_sf[-inTrain, ]  

cville_training %>%
  st_drop_geometry() %>%
  rstatix::get_summary_stats()
cville_testing %>%
  st_drop_geometry() %>%
  rstatix::get_summary_stats()
# Set up for standard, 10 fold cross-validation...
tree_ctrl <- trainControl(method = "cv", number = 10)

tree_model <- train(SalAmnt ~ SqrFtFL + Bedroms + FllBthr + Age + LotAcrs + Fireplc + NmbrOfS.cat + NumAggAssaults, 
                                  data = cville_training %>% 
                                  st_drop_geometry() %>%
                                  drop_na() %>%
                                  dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc, 
                                                NmbrOfS.cat, NumAggAssaults, PrclNmb),
                    method = "rpart",
                    trControl = tree_ctrl,
                    tuneLength = 10)

print(tree_model)
## CART 
## 
## 1233 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1109, 1110, 1111, 1109, 1110, 1110, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.01209965  177914.7  0.6468659  118296.8
##   0.01425456  181864.8  0.6317297  120806.1
##   0.01492998  182582.1  0.6298142  120973.5
##   0.01934665  188500.5  0.6087911  124453.7
##   0.02263863  192809.3  0.5919239  127046.6
##   0.03178573  201489.9  0.5559143  132747.1
##   0.05747887  216585.0  0.5021451  138916.8
##   0.08552746  227052.4  0.4427115  145874.7
##   0.12130273  248144.8  0.3389001  164249.3
##   0.31859925  292737.6  0.2052107  188593.8
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01209965.


A Tool For Assessing Feature Importance


# Note the scale argument defaults to TRUE and the importance values are 
# normalized to range between 0 and 100...
caret::varImp(tree_model)
## rpart variable importance
## 
##                        Overall
## SqrFtFL                 100.00
## LotAcrs                  89.89
## Age                      85.86
## Bedroms                  79.13
## Fireplc                  73.56
## FllBthr                  41.09
## NumAggAssaults           26.02
## `NmbrOfS.cat4+ Floors`    0.00
## `NmbrOfS.cat3 Floors`     0.00
# We can avoid this normalization by setting the scale argument to FALSE
tree_model_importance <- caret::varImp(tree_model, scale = FALSE) 

# We can also visualize the most important features in our model this way...
plot(tree_model_importance, top = 5)


In the above chunk, we introduce the caret::varImp function which provides insight into which features we have included in the model are most important for making th resulting predictions. Without getting too much into the weeds, 🌲 this function involves permuting the values of a given feature, re-estimating the model, and noting how much the prediction error changes on average1.

We fit the model on the training dataset in the above code chunk, then we applied it to the test dataset in the code chunk above using the predict function in the code chunk below. The MAE and the Absolute Percent Error are also calculated based on the known sales price AND the predicted sale prices from the model.


cville_testing_no_geometry <-
  cville_testing %>%
  st_drop_geometry() %>%
  drop_na() %>%
  dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc, 
                                                NmbrOfS.cat, NumAggAssaults, PrclNmb)

tree_model_testing <- predict(tree_model, cville_testing_no_geometry)

outputPredictions <- cville_testing_no_geometry %>%
  mutate(                          
         SalePrice_AbsError = abs(tree_model_testing - cville_testing_no_geometry$SalAmnt),
         SalePrice_APE = ((abs(tree_model_testing - cville_testing_no_geometry$SalAmnt)) / tree_model_testing) * 100) %>%
  filter(SalAmnt < 5000000)


# Assess accuracy...
mean(outputPredictions$SalePrice_AbsError, na.rm = T)
## [1] 112276.6
mean(outputPredictions$SalePrice_APE, na.rm = T)
## [1] 23.44235
# Map model performance...

cville_testing_to_plot <- left_join(cville_testing, outputPredictions %>%
                                      select(PrclNmb, SalePrice_AbsError, SalePrice_APE), 
                                    by = "PrclNmb")

tmap_mode("view") 
## tmap mode set to interactive viewing
cville_testing_to_plot <- st_make_valid(cville_testing_to_plot) %>%
  drop_na(SalePrice_AbsError)

tm_shape(nhoods_utm_sf, name = "Neighborhoods") + 
  tm_borders(col = "dodgerblue") + 
  tm_shape(cville_testing_to_plot, "Prediction Error of Simple Regression Tree Model") + 
  tm_dots(shape = 21, col = "SalePrice_AbsError", palette = "viridis", 
          style = "quantile", colorNA = "orange", textNA = "Unknown", title = "Absolute Error ($)",
          popup.vars = c("Sale Amount: " = "SalAmnt", "Square Footage: " = "SqrFtFL", 
                         "Lot (Acres): " = "LotAcrs", "% Prediction Error: " = "SalePrice_APE"),
          legend.format = list(fun = function(x) paste0("$", formatC(x, digits = 0, big.mark=',', format = "f"))),
          id = "PrclNmb") +
  tmap_options(check.and.fix = TRUE) + 
  tm_view(set.view = c(-78.4837122981158, 38.03437353596851, 13)) + 
  tm_basemap(c("Esri.WorldGrayCanvas", "CartoDB.Positron", "Esri.WorldTopoMap", "CartoDB.DarkMatter"))


These metrics and the interactive map give us a sense of how well the model performs when applied to new data. Celebrate! 🥳


fancyRpartPlot(tree_model$finalModel, main = "Simple Decision Tree Model", sub = "Charlottesville Housing Sales", digits = -3)


The code chunk above uses a new R package called rattle to visually display the decision tree created when we trained our model. Recall that decision trees are “grown” and read from the top to the bottom. Although the cville_training set contains 1546 observations, the model is fit using 1233 of these due to filtering of missing data values in a prior chunk. We can see in the figure that the first split is based on the SqrFtFL (i.e., the square footage of the home) feature and branches to the left if an observation is less 2,816 square feet and to the right otherwise. From there, other important features like , FllBthr (i.e., number of full bathrooms), LotAcrs (i.e., size of the lot), and Age (i.e., how old is the home?).


At each internal node (i.e., between the root node and the leaves or terminal nodes), the plot displays the number and percentage of observations that “flow” through that node of the decision tree, as well as what the model would predict as the sales price.


From A Single Decision Tree To A Random Forest


In this section of the Assignment, we train a model using the random forest algorithm from the randomForest package by setting the method argument of the caret::train function to “rf” rather than “rpart” which trains a simple regression tree model. So what distinguishes the random forest algorithm from a simple decision tree algorithm? Before we can answer that question, we have to introduce the concept of bagging (more information in the next class session):

Bootstrap aggregating of bagging “creates b new bootstrap samples by drawing samples with replacement of the original training data. Bagging, is one of the first ensemble algorithms machine learning practitioners learn and is designed to improve the stability and accuracy of regression and classification algorithms. By model averaging, bagging helps to reduce variance and minimize overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method.

Boehmke & Greenwell (2020)

A random forest is a special type of bagged ensemble model where a large number of individual decision trees (e.g., hundreds or thousands) are generated, but these individual trees are de-correlated. This means that each time a split is to be performed, the search for the split variable is limited to a random subset of mtry (i.e., one of the algorithm’s hyperparamters) of the total number of features available.

So why is the random forest algorithm popular?

# Set up for standard, 10 fold cross-validation...
tree_ctrl <- trainControl(method = "cv", number = 10)

rf_model <- train(SalAmnt ~ SqrFtFL + Bedroms + FllBthr + Age + LotAcrs + Fireplc + NmbrOfS.cat + NumAggAssaults, 
                                  data = cville_training %>% 
                                  st_drop_geometry() %>%
                                  drop_na() %>%
                                  dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc, 
                                                NmbrOfS.cat, NumAggAssaults, PrclNmb, NeighHood, lagSalAmnt),
                    method = "rf",
                    trControl = tree_ctrl,
                    tuneLength = 10)
## note: only 8 unique complexity parameters in default grid. Truncating the grid to 8 .
print(rf_model)
## Random Forest 
## 
## 1233 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1110, 1110, 1109, 1110, 1110, 1110, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE     
##   2     118092.25  0.8785299  76788.00
##   3      90849.46  0.9133282  52633.81
##   4      87215.62  0.9182268  47169.80
##   5      86392.53  0.9192250  45268.06
##   6      87074.39  0.9176254  44811.01
##   7      86865.48  0.9180409  44764.84
##   8      87816.69  0.9161499  44822.99
##   9      89866.89  0.9121350  45509.27
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
varImp(rf_model, scale = FALSE)
## rf variable importance
## 
##                        Overall
## SqrFtFL              4.183e+13
## Age                  2.085e+13
## LotAcrs              1.622e+13
## FllBthr              1.116e+13
## Fireplc              8.941e+12
## Bedroms              7.819e+12
## NumAggAssaults       4.168e+12
## NmbrOfS.cat4+ Floors 0.000e+00
## NmbrOfS.cat3 Floors  0.000e+00
rf_model_importance <- varImp(rf_model, scale = TRUE)
plot(rf_model_importance, top = 5)


You should also recall that caret supports over 230 machine learning algorithms and there are multiple variations on the random forest algorithm. As shown in the documentation there are multiple options for the method argument of the caret::train function that train a random forest, but the one we have selected here has a single hyperparameter—it is called mtry and represents the number of features to consider at each split point. This means that values of mtry may range between 1 and the total number of features in your model. Another approach would be to use method = 'ranger' to train a random forest algorithm using the ranger package, which allows us to tune two additional hyperparameters—splitrule and min.node.size. You could find more information on these two hyperparamters by referring to the documentation of the R package that caret is calling in the background, which is ranger in that alternate case but is randomForest in our present example.


cville_testing_no_geometry <-
  cville_testing %>%
  st_drop_geometry() %>%
  drop_na() %>%
  dplyr::select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, LotAcrs, Fireplc, 
                                                NmbrOfS.cat, NumAggAssaults, PrclNmb, NeighHood, lagSalAmnt)

rf_model_testing <- predict(rf_model, cville_testing_no_geometry)

outputPredictions_rf <- cville_testing_no_geometry %>%
  mutate(                          
         SalePrice_AbsError = abs(rf_model_testing - cville_testing_no_geometry$SalAmnt),
         SalePrice_APE = ((abs(rf_model_testing - cville_testing_no_geometry$SalAmnt)) / rf_model_testing) * 100) %>%
  filter(SalAmnt < 5000000)


# Assess accuracy...
mean(outputPredictions_rf$SalePrice_AbsError, na.rm = T)
## [1] 38358.15
mean(outputPredictions_rf$SalePrice_APE, na.rm = T)
## [1] 9.123643
# Create interactive map of RF model errors...

cville_testing_rf_to_plot <- left_join(cville_testing, outputPredictions_rf %>%
                                      select(PrclNmb, SalePrice_AbsError, SalePrice_APE), 
                                    by = "PrclNmb")
tmap_mode("plot")
## tmap mode set to plotting
tmap_mode("view") 
## tmap mode set to interactive viewing
cville_testing_rf_to_plot <- st_make_valid(cville_testing_rf_to_plot) %>%
  drop_na(SalePrice_AbsError)

tm_shape(nhoods_utm_sf, name = "Neighborhoods") + 
  tm_borders(col = "dodgerblue") + 
  tm_shape(cville_testing_rf_to_plot, "Prediction Error of Random Forest Model") + 
  tm_dots(shape = 21, col = "SalePrice_AbsError", palette = "inferno", 
          style = "quantile", colorNA = "orange", textNA = "Unknown", title = "Absolute Error ($)",
          popup.vars = c("Sale Amount: " = "SalAmnt", "Square Footage: " = "SqrFtFL", 
                         "Lot (Acres): " = "LotAcrs", "% Prediction Error: " = "SalePrice_APE"),
          legend.format = list(fun = function(x) paste0("$", formatC(x, digits = 0, big.mark=',', format = "f"))),
          id = "PrclNmb") +
  tmap_options(check.and.fix = TRUE) + 
  tm_view(set.view = c(-78.4837122981158, 38.03437353596851, 13)) + 
  tm_basemap(c("Esri.WorldGrayCanvas", "CartoDB.Positron", "Esri.WorldTopoMap", "CartoDB.DarkMatter"))

Accounting for Neighborhood

Sometimes real estate characteristics are constant across properties in the same neighborhood (e.g., access to a common swimming pool or walking trails) and the code chunk below uses the name of the neighborhood as a proxy for all those unobserved characteristics that vary across neighborhoods in Charlottesville, but that are the same for homes located within a given neighborhood . Let’s use the rstatix package to quickly get a sense of how well the random forest model performed across neighborhoods in Charlottesville. 💪


kbl(
cville_testing_rf_to_plot %>%
  st_drop_geometry() %>%
  group_by(NeighHood) %>%
  rstatix::get_summary_stats(SalePrice_AbsError),
  digits = 3, align = "lccrr", caption = "Observed Versus Predicted Sales Price by Neighborhood") %>%
  kable_paper("hover", full_width = TRUE)
Observed Versus Predicted Sales Price by Neighborhood
NeighHood variable n min max median q1 q3 iqr mad mean sd se ci
Greenbrier SalePrice_AbsError 17 249.356 70381.808 39800.595 2284.537 43738.660 41454.123 45339.706 30916.427 28581.233 6931.967 14695.114
Rutledge SalePrice_AbsError 36 379.542 66471.315 15046.934 5356.201 60220.082 54863.881 21745.875 29657.790 27136.283 4522.714 9181.597
Angus Road Area SalePrice_AbsError 5 6622.957 10776.871 6622.957 6622.957 6622.957 0.000 0.000 7453.740 1857.687 830.783 2306.623
Meadowbrook Hills SalePrice_AbsError 21 1345.967 31338.827 5184.833 5184.833 5184.833 0.000 0.000 6415.810 5828.575 1271.899 2653.136
Greenleaf Terrace/Rose Hill/Rugby Hills SalePrice_AbsError 42 1215.864 51243.833 22816.076 3951.324 46074.093 42122.768 29011.650 24444.332 18450.504 2846.975 5749.582
Woodhayven SalePrice_AbsError 1 9807.116 9807.116 9807.116 9807.116 9807.116 0.000 0.000 9807.116 NA NA NaN
Rugby SalePrice_AbsError 27 195.788 169282.479 17813.261 195.788 59469.325 59273.537 26119.665 42756.975 58181.255 11196.988 23015.738
Holmes/North Avenue Area SalePrice_AbsError 8 6934.634 36678.409 21806.522 6934.634 36678.409 29743.774 22049.060 21806.522 15898.716 5621.045 13291.659
St. Charles Place SalePrice_AbsError 2 10835.527 11598.529 11217.028 11026.277 11407.779 381.501 565.614 11217.028 539.524 381.501 4847.432
Locust Grove Extended SalePrice_AbsError 24 604.147 73833.170 1873.459 604.147 66583.415 65979.268 1881.883 27479.523 34287.945 6998.998 14478.529
Davis Avenue/Marshall Street Area SalePrice_AbsError 4 59020.924 59020.924 59020.924 59020.924 59020.924 0.000 0.000 59020.924 0.000 0.000 0.000
Towles/Merryden/Ivy Terrace SalePrice_AbsError 11 2312.841 63195.617 40560.267 19256.168 54985.717 35729.548 31585.456 36639.660 21104.442 6363.229 14178.157
North Downtown SalePrice_AbsError 43 1738.105 529309.598 33638.261 6776.798 58632.390 51855.593 37661.666 89877.056 152930.049 23321.614 47064.922
Rose Hill/Forrest Street SalePrice_AbsError 9 6366.080 6366.080 6366.080 6366.080 6366.080 0.000 0.000 6366.080 0.000 0.000 0.000
Venable SalePrice_AbsError 2 10026.958 74985.479 42506.219 26266.589 58745.849 32479.260 48153.751 42506.219 45932.610 32479.260 412688.130
Venable/Page/10th Street SalePrice_AbsError 12 6275.865 89731.829 19860.268 6275.865 89731.829 83455.964 20140.235 38472.386 38418.777 11090.546 24410.126
Locust Grove SalePrice_AbsError 16 17562.550 202033.425 17562.550 17562.550 80110.065 62547.515 0.000 65891.669 81405.432 20351.358 43377.893
Court Square/Central Business District SalePrice_AbsError 1 10886.233 10886.233 10886.233 10886.233 10886.233 0.000 0.000 10886.233 NA NA NaN
Golf Club SalePrice_AbsError 18 344.225 72068.362 20476.032 2186.013 25155.130 22969.117 26065.951 25111.585 27359.326 6448.655 13605.473
Little High Street/East Jefferson Street SalePrice_AbsError 4 610.215 610.215 610.215 610.215 610.215 0.000 0.000 610.215 0.000 0.000 0.000
Fife Estate SalePrice_AbsError 24 563.092 75023.763 12686.934 6534.271 12686.934 6152.663 9121.938 20111.892 25642.142 5234.180 10827.727
University/Maury Hills SalePrice_AbsError 3 858.457 95073.172 28485.376 14671.917 61779.274 47107.357 40959.670 41472.335 48431.384 27961.872 120310.227
Forest Hills SalePrice_AbsError 10 1075.722 23398.777 17388.323 5043.669 23398.777 18355.108 8911.099 14474.764 9697.229 3066.533 6936.979
JPA/Shamrock Road SalePrice_AbsError 53 221.800 58426.491 7311.989 3688.137 17811.894 14123.758 5372.722 16079.507 18134.350 2490.944 4998.446
North Belmont SalePrice_AbsError 103 39.830 45543.778 39.830 39.830 39.830 0.000 0.000 5420.879 13827.652 1362.479 2702.471
Ix/Belmont SalePrice_AbsError 182 32.933 8799.333 33.673 32.933 33.673 0.740 0.549 313.731 1226.083 90.883 179.327
Johnson Village SalePrice_AbsError 4 7613.389 21283.916 18509.295 15198.487 19789.782 4591.295 2636.851 16478.974 6087.005 3043.503 9685.784
Carlton/Belmont SalePrice_AbsError 29 9934.164 69899.817 53100.183 23794.371 69899.817 46105.446 24907.138 46896.307 23740.213 4408.447 9030.294
Burnet Commons SalePrice_AbsError 10 4878.703 59167.013 12039.285 12039.285 59167.013 47127.728 10014.081 29539.495 25632.209 8105.616 18336.178
Belmont SalePrice_AbsError 88 83.357 257055.491 98948.683 23443.922 131051.317 107607.395 107965.155 110024.178 91335.858 9736.435 19352.219
Ridge Street SalePrice_AbsError 18 770.982 94259.708 75740.292 62278.215 94259.708 31981.493 27456.887 65665.669 30569.749 7205.359 15201.978
Huntley SalePrice_AbsError 4 6864.159 17590.206 13034.925 11044.744 14621.234 3576.490 3819.129 12631.054 4430.518 2215.259 7049.944
Fry’s Spring SalePrice_AbsError 30 2429.874 34684.884 34176.672 7206.258 34176.672 26970.414 0.000 23094.293 14100.827 2574.447 5265.335
Azalea Gardens/Green Valley SalePrice_AbsError 18 332.018 23607.881 2090.738 516.071 8331.971 7815.900 2607.479 4577.979 5932.374 1398.274 2950.100
Brookwood SalePrice_AbsError 13 132.871 43844.981 17734.470 132.871 43844.981 43712.110 26096.131 19233.632 18832.531 5223.204 11380.385
Willoughby SalePrice_AbsError 1 9863.590 9863.590 9863.590 9863.590 9863.590 0.000 0.000 9863.590 NA NA NaN
Lochlyn Hill SalePrice_AbsError 68 1131.164 356479.208 70665.948 54334.052 291266.651 236932.599 86355.858 151871.649 131158.400 15905.292 31747.096
Stonehenge Extended SalePrice_AbsError 94 3898.536 25350.974 14320.451 12089.026 25350.974 13261.948 3474.030 17435.348 6021.983 621.120 1233.421


The above code chunk enlists the kableExtra::kbl function to gently format a table for us. You can learn more about this function here, but I have not found table formatting to be a particularly useful place to channel effort. Packages like formattable and pixiedust are reviewed at the links below if you want to take a deeper dive into table formatting:

As outlined in greater detail here, the neighborhood effects model not only fits our data better, it is also likely to perform better when it “sees new data” its accuracy is more consistent across neighborhoods. This speaks directly to Steif’s definition of generalizability which consists of:

“… the ability to predict accurately on new data - which was the focus of cross-validation in Chapter 3” and “…the ability to predict with comparable accuracy across different group contexts, like neighborhoods” otherwise, “…the algorithm may not be fair” and frankly not very useful…

Recall the plot of RMSE and R-Squared across the k-Folds validation from Lab Exercise #6 that we can access like this rf_model$resample. Those plots speak to the first aspect of generalizability as articulated here—whether the model performs at a comparable level when it “sees new data” that have not been used to train and test it. The table displayed here instead addresses the second aspect of generalizability and provides insight into whether the model’s accuracy is consistent across neighborhoods.

When you have read and understand what the preceding code chunks are doing, please proceed… 👑


Exercise


This exercise asks you to experiment with two features that we created, but that we have not yet included in our models—NeighHood and lagSalAmnt. By drawing on code in the three preceding code chunks, write your own code that trains and tests a random forest model that includes NeighHood and an alternate that includes lagSalAmnt then respond to the questions posed below:

  1. For your random forest model that includes NeighHood as a feature:
    • Which Charlottesville neighborhoods are most influential in predicting sales price?
    • Does adding this feature improve model performance relative to the results from the two preceding code chunks?
  2. For your random forest model that includes lagSalAmnt as a feature:
    • Does adding this feature improve model performance relative to the results from the two preceding code chunks?
    • How might you explain what this feature is capturing?
  3. Choose one of the two random forest models you have trained and tested.
    • How well did your model perform across neighborhoods? Are there specific neighborhoods where your selected model performs best? Are there specific neighborhoods where your selected model performs best? Hint: take another look at the “Inspect Sale Price Prediction Error By Neighborhood” chunk above.
    • Can you think of some additional features that might help improve the performance of your model?
    • Briefly describe the steps that would need to be taken in order to apply your selected model to Staunton, Fredericksburg, Roanoke, etc.
  4. If the random forest algorithm has an Achilles’ heel, it is the presence of correlated features. If we run a line of code like DataExplorer::plot_correlation(cville_training %>% drop_na() %>% select(SalAmnt, SqrFtFL, Bedroms, FllBthr, Age, Fireplc, LotAcrs, NumAggAssaults)) we see that the presence of a fireplace is highly correlated with sales price AND with square footage of the home. Sometimes we can achieve addition through subtraction 🤣 so let’s see if this is one of those times!
    • Does removing Fireplc as a feature improve the performance of your model?



Work Products

Please submit an R Notebook and knitted HTML file that shows your work and responses for each of the Exercise included in this Assignment. Also, briefly comment on your experience with R this week (i.e., provide the requested Reflective Content described in the rubric below). Please upload your report to Collab by 5:00 pm on Friday April 21st.


Assessment Rubric

This Lab Exercise will be graded on a 100-point scale according to the rubric below:

Length and formatting (10 pts)

  • Are the Lab Exercise responses provided in an acceptable format (e.g., R Notebook, rendered HTML file, etc.)?
  • Is there enough explanatory text to evaluate the work?

Clarity of writing and attention to detail (20 pts)

  • Is the text component clearly written? Please do not rely too heavily on bulleted lists.
  • Are there more than one or two grammatical or typographical errors? Please perform a spelling/grammar check prior to submission.

Technical Content (45 pts)

  • Are the requested graphics, tables, etc. included and intelligible?
  • Does the submission explicitly and thoroughly respond to any questions posed?
  • Please explain why you reached the conclusions you did when responding to the questions posed.

Reflective Content (25 pts)

  • Does the response reflect on the procedures used (i.e., what am I clicking and why?)?
  • Is there evidence that the student understands how the substance of the Lab Exercise relates to concepts from the lectures/readings and/or how the substance of the Lab Exercise might be applied in the work planners (or you personally) do?

© Bev Wilson 2023 | Department of Urban + Environmental Planning | University of Virginia



  1. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.↩︎